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[57] ABSTRACT 

A computerized method for load balancing in a geographi- 
cally distributed or clustered system is disclosed. An arbiter 
assigns clients to nodes. The arbiter partitions clients into 
groups based on their request load. Each group is dynami- 
cally scheduled among nodes, thus avoiding high load 
groups from being allocated to the same node and overload- 
ing the system. If one of the nodes becomes overload, an 
alarm is generated, so that fewer or no new clients are 
allocated to the overloaded node. 

11 Claims, 9 Drawing Sheets 
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COMPUTER SYSTEM AND METHOD FOR D. M, Kish, W., Mukheijee, R., and Tewari, R., "AScalable 

LOAD BALANCING WITH SELECTIVE and Highly Available Web Server", Proc. 41st IEEE Com- 

CONTROL PUt« Society Intl. Conf. (COMPCON) 1996, Technologies 

for the Information Superhighway, pp. 85-92, February 

CROSS-REFERENCE TO RELATED PATENT 5 1996. The problem is due to caching of the association 

APPLICATIONS between name and IP address at various name servers in the 

network. Thus, for example, for a period of time (time-to- 

The present application is related to copending apphca- live ) all new clients bebjnd ia intermediate name server in 

tion serial number (to be assigned) filed of even date the network win be mled t0 jus , one of lne sites 

^ ^S,™n°? k ?^ A f?J^ 10 One known method to solve this problem within a local 

?,^l C ^n« S ™« ^ni^.^ 1 ^^ 1 !? cluster °f nodes - i e - « a single site, uses a so-called TCP 

USING A GENERALIZED TCP ROUTER' by Darnel M. rQUter u describe(J \ n . c ' |ement R and Smith 

Dias et al., application Ser. No. 08/861,749 filed May 22, St hen E „ A virtual Multi . Processor implemented by an 

1997. (Attorney docket number Y0996-225). Encapsulated Cluster of Loosely Coupled Computers", IBM 

FIELD OF THE INVENTION 35 R esearcn Report RC 18442, 1992; see also U.S. Pat. No. 

5,371,852, issued Dec. 6, 1994, by Attanasio et al, entitled 

The present invention relates generally to an improved "Method and Apparatus for Making a Cluster of Computers 

computer system and method for providing load balancing. Appear as a Single Host," which are hereby incorporated by 

A more particular aspect is related to a system and method reference in their entirety. Here, only the address of the TCP 

for load balancing in a geographically distributed or clus- 2Q router is given out to clients; the TCP router distributes 

tered system including a set of computing nodes, a subset of incoming requests among the nodes in the cluster, either in 

which can handle a client request, and wherein an arbiter a round-robin manner, or based on the load on the nodes. As 

mechanism assigns sets of clients to nodes, and wherein sets noted, the TCP router method as described in these papers 

of clients are categorized into groups based on an associated only applies to a local cluster of nodes. More specifically, the 

request load, and wherein groups are dynamically scheduled 25 TCP router can act as a proxy, where the requests are sent to 

among the nodes. a selected node, and the responses go back to the TCP router 

BACKGROUND aiK * l ^ en t0 ^ c ^ ent * ^ s P roxv moc * e °^ °P erat i° n can l eacl 

to the router becoming a bottleneck. Also, because of the 

The traffic on the World Wide Web is increasing extra network hops, both for incoming and response packets, 

exponentially, especially at popular (hot) sites. In addition to 30 it is not suitable for a geographically distributed environ- 

growing the capacity of hot sites by clustering nodes at that ment. In another mode of operation, which we will refer to 

site, additional, geographically distributed (replicated) sites as the forwarding mode, client requests are sent to a selected 

are often added. Adding geographically distributed sites can node, and the responses are sent back to the client directly 

provide for both added capacity and disaster recovery. The from the selected node, bypassing the router. In many 

set of geographically distributed, and replicated, sites are 35 environments, such as the World Wide Web (WWW) the 

made to appear as one entity to clients, so that the added response packets are typically much larger than the incom- 

capacity provided by the set of sites is transparent to clients. ing packets from the client; bypassing the router on this 

This can be provided by an arbiter that assigns clients to response path is thus critical. However, the TCP router 

sites. In order to support a load that increases close to method In forwarding mode, only applies to a cluster of 

linearly with the total capacity of the set of sites, it is 40 nodes that are connected directly to the router by a LAN or 

important that the client load be balanced among the sites. a switch, i.e., the nodes in the multi-node cluster cannot be 

Thus there is a need for methods used by the arbiter for geographically remote, or even on a different sub-net. The 

balancing the load among the sites. reason is that lower level physical routing methods are used 

One known method in the art that attempts to balance the to accomplish this method, 

load among such geographically distributed replicated sites, 45 Thus there is a need to provide a method for better load 

is known as the Round-Robin Domain Name Server (RR- balancing among geographically distributed sites. 

DNS) approach. The basic domain name server (DNS) SUMMARY 
method is described in the paper by Mockapetris, P., entitled 

"Domain Names— Implementation and Specification", RFC In accordance with the aforementioned needs, the present 

1035, USC Information Sciences Institute, November 1987. 50 invenlion * directed to an improved system and method for 

In the paper by Katz., E., Butler, M., and McGrath, R., load balancing among geographically distributed sites, 

entitled "A Scalable HTTP Server: The NCSA Prototype", Another aspect of the present invention provides an 

Computer Networks and ISDN Systems, Vol. 27, 1994, pp. improved system and method for geographical load balanc- 

68-74, round-robin DNS (RR-DNS) is used to balance the ing i° the Internet and the World Wide Web. 

node across a set of web server nodes. In this approach, the 55 Yet another aspect of the present invention minimizes the 

set of distributed sites is represented by one URL (e.g. overhead for load balancing as compared to that for serving 

www.hotsite.com); a cluster subdomain for this distributed the client requests 

site is defined with its subdomain name server. This subdo- These and further advantages are achieved by this inven- 

main name server maps client name resolution requests to tion by identifying sources with heavy client loads, and 

different IP addresses in the distributed cluster. In this way, 60 those with lighter loads. More generally, several load tiers 

subsets of the clients will be pointed to each of the geo- may be identified, each with a comparable load. Each of the 

graphically distributed sites. Load balancing support using tiers are then scheduled separately, so as to better balance the 

DNS is also described in the paper by Brisco, T, "DNS load among the nodes. Further, if one of the nodes becomes 

Support for Load Balancing", RFC 1974, Rutgers overloaded, an alarm may be generated, so that fewer (or no) 

University, April 1995. 65 new clients are allocated to the overloaded node. 

A key problem with RR-DNS is It may lead to poor load More specifically, a method in accordance with the 

balance among the distributed sites, See, for example, Dias, present invention includes the following: As outlined above, 
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the basic problem with the RR-DNS approach is that node. According to the present invention, an "extended 

gateways/fire- walls/name-servers cache the name to IP DNS" 62 provides improved load balance of client 50 

address mapping, and all the new requests from behind a service requests among servers 54 in the server group 58. 

gateway are mapped to the same server site for a so-called pic IB depicts a generalized network having features of 

time-to-live. (TTL) period. We will refer to the load arriving 5 me present invention. As depicted, a source (100) represents 

from behind the gateway during the TTL period as the any computing node that can issue mapping and service 

hidden load. The basis of the disclosed method is to estimate requests, and arbitrator (110) represents any computing node 

the hidden load, and to perform the name to IP address that can schedule a mapping request to one of the sever 

mapping based on the estimated hidden load. Gateways are nodes ( 15 o). FIG. 1A, based on an Internet, is a special case 

assigned to tiers based on the hidden load. Requests from 10 0 f FIG. IB wherein the arbitrator (140) corresponds to the 

each tier are mapped to sites using a round robin technique. extended DNS (62). The arbitrator consists of CPU (115), 

The basic idea in this technique is to distribute the load from memory (116) and storage devices (112). A two-tier round 

gateways with comparable hidden load among the sites; this ro bi n scheduler (148) is employed by the arbitrator to 

leads to better distribution of the hidden load among sites. assign/schedule mapping requests to one of the server nodes. 

BRIEF DESCRIPTION OF THE DRAWINGS **? dule ' fa P re f abl V implemented as computer 

executable code stored on a computer readable storage 

These, and further, objects, advantages, and features of device (112) such as a magnetic or optical disk or other 

the invention will be more apparent from the following stable storage medium. As is conventional, the schedule? is 

detailed description and the appended drawings, wherein: loaded into memory (116) for execution on CPU (115). 

FIG. 1A is a diagram of an Internet network having 20 skilled in the M{ wil1 appreciate that a generalization 

features of the present invention; 10 multi-tier (greater than two) round robin is straightfor- 

F1G. IB is a diagram of a generalized network having ward - 

features of the present invention; round robin scheduler (148) includes several com- 

FIG. 2 is an overall block diagram of a preferred embodi- 25 Points: an alarm/request handler (138) a mapping request 

ment of the arbitrator; handler < 140 } and a si f ilsi \ c .^° r ( 145 >- ' com P°- 

m ^ „ . n t „ , nents are explained in details in FIG. 3, 4 and 6, respectively. 

FIG. 3 is a flow chart of the alarm/recovery request Seyera , data structureS) L ^ the mapping count table (12) , 

handler of the arbitrator; xr/i(X count taWe (130)> and the request ratio uWe (135) 

FIG, 4 is a flow chart of the mapping request handler of are also maintained. The operations on these data structures 

the arbitrator; 30 will be explained with the round robin scheduler compo- 

FIG. 5 is a flow chart of the update routine used by the nents. The server node (150) can be any computing node that 

mapping request handler; can handle service requests from the sources (100), such as 

FIG. 6 is a flow chart of the statistic collector of the providing data/object accesses, file transfers, etc. The server 

arbitrator; noc * e (I 5 ®) consists of CPU (155), memory (160) and 

FIG. 7 is a flow chart of an embodiment of the server; and 35 stora S e devices ? 58 )- ^ *e™ node executes a service 

FIG. 8 is a flow chart of the check utilization routine used ?f .f' handle . r < 170 > to P rocess the ret * uests 45 

, detailed m FIG. 7. 
by the server. 

Referring again to FIG. IB, let N denote the number of 

DETAILED DESCRIPTION 4Q sources 100 and M denote the number of servers 150. Let 

FIG. 1A depicts a block diagram of the Internet including G ™Q$ ^ the number of service requests from source lOOw 

features of the present invention. As depicted, Client work- l ? server 150; in an interval of given length, I and GD( ) be 

stations or personal computers (PCS) 50 are connected * e number of mapping requests in the same interval from 

through a common gateway 52 to the network 64. Commu- the 100,1 to an arbitrator 110 having features of the 

nications utilize the TCP/IP suite of protocols. Clients 50 45 present mv^^ 

request services from one or more servers 54 which are also b y 0^<N+1.0<)<M+1, ; and the mapping count 

connected to the network 64. Typical service requests ^ ^ can be ^presented by GD(1), 0<I<N + 1 Then let 

include world-wide-web (WWW) page accesses, remote file DW ( I > the avera 8 e number of xryi( * rec ^ uests from the 

transfers, electronic mail, and transaction support. For cer- ^ P er mappi ^ u req ^;^ e f "S ucst i ^ all0 ii table 

tain services, more than one server may be required, forming 50 ^ can ** presented by DW(I), 0<I<N + 1 Recall that 

a service group (58), to handle the high traffic requirement. DW W was earl | er referred l ° aS the hldden l ° ad ° f each 

These servers may be Gus located at geographically distinct & atewa y in the Internet context ' 

locations. In any event, the existence of the multiple servers FIG - 2 snows an example of a logic diagram for the 

is transparent to the clients 50. Clients issue service requests arbitrator 110 using a 2-tier round robin process in accor- 

based on a logical or symbolic name of the server group. 55 dance with the present invention. The two-tier round robin 

This can be provided by a conventional domain name server scheduler (148) partitions its sources into two groups. In the 

(DNS) which maps the logical or symbolic name into the Internet context, the sources become gateways 52. The 

physical or IP address of one of the server nodes 54 in the scrver assignment for mapping requests from the different 

server group 58. This is done through a mapping request groups are handled separately in round robin order. In other 

from the clients to the DNS. The mapping requests are thus 60 words, each source group is handled as a separate and 

distinct from the service requests which are issued from the independent round robin tier. 

clients to the servers. To reduce network traffic, this mapping For example, assume there are six sources A, B, C, D, E 

request is typically not issued for each service request. and F and three servers 1, 2, and 3. Furthermore, A and D are 

Instead, the result of the mapping request is cached for a in one group (the first tier) and B, C, E and F are in the other 

time-to-live. (TI'L) period. Subsequent service requests 65 group (the 2nd tier). The 2-tier round robin scheme assigns/ 

issued during the TfL period will follow the result of the schedules mapping requests from A and D (similar for 

previous mapping and hence be routed to the same server requests from B, C, E and F) in round robin order to the 
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servers with do regard to the requests and assignment of the performs the server assignment (i.e. mapping name to IP 

other group . Assume the mapping request stream arriving at address in the Internet application) using a round robin 

the arbitrator 110 consists of requests from A, B, E, F, A, D, method according to the tier of the source group. In step 405, 

B, C, E, . . . . Furthermore, assume that the two-tier round the in mapping count table 125 GD(I) is incremented by 1. 

robin starts the 1st tier at server 1 and 2nd tier and server 3. 5 In step 410, the load (including any hidden load) associated 

Then, the server assignment for the mapping request stream with the source is used to determine its group/tier for the 

will be 1, 3, 4, 1, 2, 3, 2, 3, 4, .... This is because the input round robin. Specifically, DW(I) is tested to determine 

substream belonging to the 1st tier consists of A, A, and D, which tier of the round robin the source 100/j belongs to. In 

i.e. the 1st, 5th and 6th elements of the request stream. These step 415, if DW(I) is larger than the load level threshold TH, 

requests get assigned to server 1, 2 and 3, in round robin 1Q it means the source belongs is to the first tier group. The 

order. Similarly, the input substream belonging to the 2nd "update" function (in FIG. 5) is invoked to determine the 

tier consists of B, E, F, B, C, E, i.e. the 2nd, 3rd, 4th, 7th, server selection (i.e. the mapping of the name to an IP 

8th, and 9th elements of the request streams. These requests address) where the function parameter (p) representing the 

get assigned to server 3, 4, 1, 2, 3, 4, in round robin order. index of the last server assignment for the tier group under 

The 2-tier round robin scheduler (148) maintains two indices 15 consideration is set to PH. Otherwise, the source belongs to 

PH and PL on the last server assignments in round robin the 2nd tier group and in step 420 when the function 

order for the two tiers, respectively. The scheduler incre- "update" is invoked to determine the server selection for the 

ments (modulo the number of servers) the value of PH for 2nd tier group, the function parameter (p) representing the 

selecting the server for the next assignment of mapping index of the last server assignment for the 2nd tier group is 

requests from the 1st tier source group. Similarly, it incre- 2Q set to PL. 

ments (modulo the number of servers) the value of PL for FIG. 5 depicts an example of the update function of steps 

selecting the server for the next assignment of the requests 415 and 420. The update function is invoked with a param- 

from the 2nd tier source group. In the previous example, at eter (p) which represents the last server assignment for the 

the end of the processing of the stream requests, the value of tier group of the source request. As depicted, in step 505, the 

PH (for the first tier) is 3 and the value of PL (for the second 25 parameter p is incremented and moduloed by M. In step 510, 

tier) is 4, where the initial values before the processing are the state of the server 150; is tested for overloading. If W(p) 

4 and 1, respectively. is not equal to zero, server p is selected and the source IOOai 

Referring again to FIG. 2, in step 205, the parameters, PH is notified of the selection, in step 515. If in step 510, 

and PL, are initialized. That is to say for the first tier, the W(P)*0, step 505 is repeated to generate another candidate 

round robin starts searching for the next server assignment 30 server. 

after PH which is not overloaded; and for the second tier the FIG. 6 shows an example of the statistics collector. As 

round robin starts searching for the next server assignment depicted, in step 605, the arbitrator 110 collects the number 

after PL which is not overloaded. One can choose the initial of service requests GW(I,j), 0<I<N+1, from each server 

value for PH to be 1 and that for PL as close to one half of 150;, . . . 150/1. This can be done by explicitly sending a 

I. This staggers the two tiers. Also a load level threshold 35 message for the requested information. In step 610, After 

(TH) is initialized for determining the tier classification. As collecting the information from all servers DW(I), the aver- 

an example, TH can be chosen as half of the average service age number of service requests per mapping request GD(I) 

requests per mapping request over all sources. A source 100 from source 100/1, is calculated. In step 615, GD(I) is reset 

with an estimated (hidden) load (i.e. DW(I)) higher than TH to zero. Finally, in step 620, the timer interval (TD) is reset 

will be assigned to the 1st tier group and the source 100n 40 to t. 

with an estimated (hidden) load (DW(I)) less than TH will F IG. 7 depicts an example of a logic flowchart for a server 

be assigned to the 2nd tier group. 150; processing requests in accordance with the present 

In step 210, a timer interval (TD) which triggers the invention. As depicted, in step 705, two utilization 
collection of statistics is set to t, say 5 minutes. An array thresholds, representing overloading (UH) and returning to 
W(j), 0<J<N+1, is set to 1, and arrays GD(I) and DW(I), 45 normal (UL), are initialized. For example, one can choose 
0<I<M+1, are set to zero. The arbitrator U0 then repeatedly UH to be 90 percent utilization and UL to be 70 percent 
checks for input. In step 215, upon detection of the arrival utilization. In step 710, service request array GW(I,j), for 0, 
of an alarm/recovery request from one of the servers 150;, I<M+1, is initialized to zero. Also, a utilization timer inter- 
file arbitrator 110 executes the alarm/recovery request han- val (TW) for checking utilization is initialized to s, say 1 
dler 138, in step 220. Details of the alarm/recovery request 50 minute, and a state variable (TAG) is set to zero. Note that 
handler will be described with reference to FIG. 3. In step TAG is set to zero when the server is determined to be 
225, if a mapping request a from source IOOaz (0<I<N+1) is overloading. Server 150; then repeatedly checks for input. In 
detected, the arbitrator 110 invokes the mapping request step 715, upon detection of the arrival of a service request 
handler 140, in step 230. Details of the mapping request from source lOO/i, GW(I j) is incremented in step 720 and 
handler will be described with reference to FIG. 4. In step 55 the service request is processed by the service request 
235, if the expiration of the statistic collection timer interval handler 170, in step 725. In step 730, if a data collection 
(TD) is detected, the arbitrator 110 executes the statistic request from the arbitrator 110 is detected, server 150; sends 
collector routine 145, in step 240. Details of the statistic GW(ij) for 0<I<M+1, to the arbitrator 110, in step 735 and 
collector routine will be described with reference to FIG. 6. sets GW(lj), for 0<I<M+1, to zero, in step 740. In step 745, 

FIG. 3 shows an example of the alarm/recovery request 60 if the expiration of the utilization timer interval (TW) is 

handler logic. In step 305, the request type is checked to detected, the server 150; executes a check utilization logic, 

determine whether it is an alarm request. If it is found to be in step 750. An example of the check utilization logic will 

an alarm request, in step 310, W(j) is set to zero to indicate be described in FIG. 8. 

that server 150; is in an overloaded state. Otherwise in step FIG. 8 depicts an example of the check utilization logic of 

315, a recovery request is received and, W(j) is set to 1. 65 step 750. As depicted, in step 805, the server slate variable, 

FIG, 4 shows an example of the mapping request handler. (TAG) is checked. If TAG is equal to zero, in step 810 the 

It determines the tier group of the source request and then server utilization is checked. In step 815, if the utilization 
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exceeds the overload threshold (UH), TAG is set to one and 
in step 820, an alarm message is sent to the arbitrator 110. 
In step 840, the utilization (TW) timer interval is reset to s. 
In step 810, if the server utilization is less than threshold 
UH, step 840 is executed. In step 805, if TAG does not equal 
zero, in step 825, the utilization of the server is checked. If 
the utilization is returning to normal, i.e., less than threshold 
UL, in step 830, TAG is set to zero. In step 835, a recovery 
(to normal) message is then sent to the arbitrator 110. 

Those skilled in the art will readily appreciate that various 
extensions to the disclosed scheme can be used. For 
example: The case with two tiers is described in the embodi- 
ment above. Alternatively, partitioning the tiers using per- 
centiles of the hidden load could be used. Another technique 
is to use recursive partitioning starting with the mean, and 
recursively resplitting the tier having the higher load. A 
weighted round robin technique within each tier can be used 
with weights based on the capacity of each site. This would 
handle heterogeneous or clustered sites having different 
CPU MIPS or numbers of CPUs. 

Another extension is to stagger the round robin assign- 
ments among tiers, such that different tiers start at different 
sites. Another extension is to use different round robin orders 
among the different tiers, to avoid convoy effects where the 
different tiers move together because of similar ITLs. For 
instance, a pseudo random round robin order for each tier 
could be used. 

Another extension is to use threshold based alarms based 
on the load at individual sites; if the load at a site crosses a 
pre-defined threshold, the weight of the site can be reduced, 
or the weight can be set to zero thus eliminating the site from 
consideration. A second threshold can be used such that the 
site gets back its normal weight when the load falls below a 
lower threshold. Various other extensions to the disclosed 
method can be used and are considered to be in the spirit and 
scope of this invention. 

We claim: 

1. In a multi-node server environment wherein client 
requests can be satisfied by routing a client request to any 
server, and wherein clients are divided into groups, and 
wherein client groups periodically send requests to an 
arbitrator, a computerized method employed by the arbitra- 
tor for assigning a server to service some or all of the 
requests from a client group, comprising the steps of: 
estimating a load, associated with the requests from client 

groups to an assigned server node; 
partitioning client groups into tiers, in response to said 

step of estimating a load; and 
for each tier, separately scheduling the client groups to the 
assigned server node. 
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2. The method of claim 1 wherein each of the multi-node 
servers are web servers at geographically distributed sites, 
wherein the arbitrator comprises an extended domain name 
server (DNS), and wherein the client group consists of the 
5 set of clients behind a common gateway or SOCKS server, 
the method comprising the steps of: 

said estimating step including estimating the hidden load 
behind each gateway; 
30 said partitioning step including assigning gateways to the 
tiers; and 

said scheduling step comprising mapping name requests 
to IP addresses using a separate round robin method for 
each tier. 

35 3. The method of claim 2, wherein said estimating step 
comprises the step of estimating the hidden load behind the 
gateway as a ratio of a total number of page requests from 
the gateway to web server sites, to a number of name server 
requests from the gateway. 

20 4. The method of claim 2, wherein a different round robin 
order is used for each tier. 

5. The method of claim 2 wherein said partitioning step 
comprises partitioning the gateways into two tiers wherein a 
first tier includes gateways having more than a mean hidden 

25 load, and a second tier includes remaining gateways. 

6. The method of claim 5, further comprising the step of 
recursively splitting at least one of the tiers are according to 
the mean load within that tier. 

7. The method of claim 4 wherein the round robin order 
30 for each tier is pseudo-random. 

8. The method of claim 2 wherein the round robin order 
for different tiers is staggered, with different sites as starting 
points. 

9. The method of claim 8 wherein a load threshold is used, 
35 further comprising the steps of: 

detecting if the load at a site exceeds the threshold; and 
reducing the weight of one or more of the tiers for the site 
exceeding the threshold. 
4Q 10. The method of claim 9, further comprising the step of: 
said reducing step including reducing the weight to zero 

for the site exceeding the threshold; 
detecting if the load at a site having zero weight has fallen 
below a second threshold; and 
45 increasing the weight of the site detected to have fallen 
below the second threshold. 
11. The method of claim 1 wherein, in said scheduling 
step further comprises the step of assigning each tier using 
a round-robin assignment. 

50 

***** 
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